Lecture 2

This lecture taught by of Prof. Cathy Yi-Hsuan Chen focuses on introducing importing packages, reading and writing files, using pandas to read and write structured data.

Specifically, the code can be found in the Github

Outlines


Pandas

  • Pandas contains high-level data structures and manipulation tools designed to make data analysis fast and easy in Python
  • It provides two workhorse data structures: Series and DataFrame.

Series

  • Series is a one-dimensional array-like object containing an array of data (of any NumPy data type) and an associated array of data labels, called its index.
import pandas as pd
Series1 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])
  • You can create a series from a dictionary object
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
Series3 = pd.Series(sdata)

Data-Frame

  • DataFrame represents a tabular, spreadsheet-like data structure containing an or- dered collection of columns, each of which can be a different value type (numeric, string, boolean, etc.)
  • DataFrame has both a row and column index; it can be thought of as a dict of Series (one for all sharing the same index).
  • There are numerous ways to construct a DataFrame
# Way 1: a dict of equal-length lists or NumPy arrays
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'], 'year': [2000, 2001, 2002, 2001, 2002],
'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}

frame1 = pd.DataFrame(data)

# you can order the columns
frame2 = pd.DataFrame(data, columns=['year', 'state', 'pop'])
# Way 2: a nested dict of dicts format
data = {'Nevada': {2001: 2.4, 2002: 2.9}, 'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}

frame3 =pd.DataFrame(data)
  • generate date sereis
dates = pd.date_range('1/1/2020', periods=10)
df = pd.DataFrame(np.random.randn(10, 5),index=dates, columns=['A', 'B', 'C', 'D','E'])
  • Indexing, selection, and filtering in DataFrame
import numpy as np  # numpy is fundamental package for scientific computing
data = pd.DataFrame(np.random.randn(4, 4),
index=['Ohio', 'Colorado', 'Utah', 'New York'],
columns=['one', 'two', 'three', 'four'])

# please implementing the following selection
data['two']

data['two'][0]

data[['three', 'one']]

data[:2]

data[2:]

data[data['three'] > 0.1]

# please implementing the following filtering
data[data < 0.1] = 0

data.ix['Colorado', ['two', 'three']]

data.ix[['Colorado', 'Utah'], [3, 0, 1]]

data.ix[data.three > 0.1, :3]

# loc: Access a group of rows and columns by label(s) or a boolean array.
data.loc['Ohio']

# Single label for row and column
data.loc['Ohio','three']

# iloc: Purely integer-location based indexing for selection by position
data.iloc[0]
  • you can drop NA from Dataframe: dropna()
  • you can fill NA as 0: fillna(0)

Data Input/Output

Reading and output text files

When processing very large files or figuring out the right set of arguments to correctly process a large file, you may only want to read in a small piece of a file or iterate through smaller chunks of the file.

using stock of AAPL as an exmaple, find data here, and save it into your working directory

apple_stock = pd.read_csv('AAPL.csv', index_col='date', parse_dates=True)
# slicing
apple_stock_2013 = apple_stock.loc[apple_stock.index.year == 2013, ['low', 'high', 'open', 'close', 'volume']]
# sorting 
apple_stock_2013.sort_values(by='volume', ascending=False, inplace=True)

# Save the new data as json or csv format
apple_stock_2013.to_json('AAPL_2013.json')
apple_stock_2013.to_csv('test.csv')

OS module

The OS module in Python provides a way of using operating system dependent functionality

using corpus of shakespeare as an exmaple, find text here, and save it into your working directory

import os

path_direct = os.getcwd()
os.chdir(path_direct + '/course')

# Using build-in function, open(), to open the file and using close() to close the file
shakespeare = open('shakespeare.txt', 'r', encoding='utf-8')
for string in shakespeare:
    print(string)
shakespeare.close()
  • Way 1: read strings
with open('shakespeare.txt', 'r') as shakespeare_read:
    # read(n) method will put n characters into a string
    shakespeare_string_10 = shakespeare_read.read(10)
    shakespeare_string = shakespeare_read.read()
  • Way 2: read single line
with open('shakespeare.txt', 'r') as shakespeare_read:
    # readline() method will read one line once.
    print(shakespeare_read.readline(), end='*')
    print(shakespeare_read.readline(), end='*')
    print(shakespeare_read.readline(), end='*')
  • Way 3: read multiple lines, and create a list of strings
with open('shakespeare.txt', 'r') as shakespeare_read:
    # readlines() method will put content into a list, every line is a string in the list
    shakespeare_lines = shakespeare_read.readlines()
    print(shakespeare_lines)

Additional Resources

results matching ""

    No results matching ""